- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25
Add Muon #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Muon #203
Conversation
        
          
                src/rules.jl
              
                Outdated
          
        
      | X = G / (norm(G) + eps(T)) | ||
| transposed = size(G, 1) > size(G, 2) | ||
| if transposed | ||
| X = X' | ||
| end | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like re-using X for different things, and sometimes the compiler feels the same way:
| X = G / (norm(G) + eps(T)) | |
| transposed = size(G, 1) > size(G, 2) | |
| if transposed | |
| X = X' | |
| end | |
| X1 = G / (norm(G) + eps(T)) | |
| transposed = size(G, 1) > size(G, 2) | |
| X2 = transposed ? transpose(X1) : X1 | 
But also, I don't like creating type instability. So even better, the inner loop could be pulled out to another function:
| X = G / (norm(G) + eps(T)) | |
| transposed = size(G, 1) > size(G, 2) | |
| if transposed | |
| X = X' | |
| end | |
| X = G / (norm(G) + eps(T)) | |
| if size(G, 1) <= size(G, 2) | |
| _innerloop(X, a, b, c) | |
| else | |
| transpose(_innerloop(transpose(X), a, b, c)) | |
| end | |
| end | |
| function _innerloop(X, a, b, c) | |
| for _ in 1:5 | |
| A = X * X' | 
| - Fallback function (`fallback`): Function to control when, in addition to 1D arrays, the fallback optimizer should be used. Will be passed the parameter array and must return a boolean. | ||
| Note: Works best with large batch sizes and may not be suitable for fine-tuning. | ||
| In nanoGPT speedrun experiments, Muon is used for the internal layer >2D weights, and AdamW is used for the 1D weights, embeddings, and heads. | ||
| `Optimisers.adjust!(optimiser_state, η::Real)` will adjust the fallback optimizer's `eta` to `η * (opt.eta / eta)`, and Muon's `eta` to `η`, preserving their ratio, | ||
| but `Optimisers.adjust!(optimiser, eta = η)` will only adjust Muon's learning rate (allowing you to adjust the fallback optimizer's learning rate separately). | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What it seems you really want is for setup to use Muon on some arrays, and AdamW on others. But instead this is rolled into this particular meta-optimisation rule, which is also called Muon. Maybe we should think about how to do that in a bit more generality?
One spelling would be setup(OptimiserIfElse(fun, AdamW(), Muon()), model) which does fun(x) ? init(AdamW(), x) : init(Muon(), x), with some stuct OptimiserIfElse <: AbstractRule. But it's a "fake rule" which is digested at setup time.
Another would be setup(fun::Function, model::Any) which is like
opt_state = setup(model) do x
  ndims(x) == 1 ? AdamW() : Muon()
endI'm sure we batted around such ideas when writing this package, but nobody had a concrete need. One thing we wondered was whether this which-rule function ought to see just x or, for instance, the field name, or the layer's type, or what? ndims(x) == 1 is a way of selecting bias but can't distinguish weight matrices from different layers.
Edit, a 3rd way is just to let you handle it. Too obscure, even if documented? Not sure.
opt_state = fmapstructure(model; exclude=isnumeric) do x
  rule = ndims(x) == 1 ? AdamW() : Muon()
  setup(rule, x)
endThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly agree that this functionality would be good to have in general, but as you say it isn't quite clear what to switch on.
But in this case, when people say "Muon" they mean "Muon for >2D with an AdamW fallback for 1D" - see https://github.com/KellerJordan/Muon/blob/master/muon.py
And the only things they use to switch are directly inferable from the tensor itself, which we have access to inside the optimiser so we can get the same behavior. So I think it makes sense to, here, just call this Muon? Also, it supports the same (as the python version) way of adjusting the two eta values during warmup/cooldown (retaining their ratio) saving the user a little effort if they want to use it as-is.
| FYI I've made this: https://github.com/MurrellGroup/CannotWaitForTheseOptimisers.jl which can house new optimisers until you folks are happy to have them included in Optimisers.jl (mostly because I really just need to be able to use them conveniently for now). | 
This adds Muon (https://kellerjordan.github.io/posts/muon/), which uses an approximate orthogonalization before the update. There isn't a publication, but it gave a key improvement in the nanoGPT training "speedrun" attempt: https://github.com/KellerJordan/modded-nanogpt
This was started by @mashu
PR Checklist